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1. INTRODUCTION 

While the global economy has focused on services rather than products, technological advancements 
have kept place. Because of the wide range of electronic platforms that offer services, Information Technology 
IT has become a vital part of our daily lives [1]. Many people use them for leisure, shopping, and other 
activities. Every company now has a collection of applications that have evolved due to digitization. A large 
and complicated IT infrastructure is needed to support this product line [2], These advances demonstrate that 
IT support systems are critical in an organization's support operations. In contrast, huge organizations spend 
millions of dollars on commercial text classification algorithms for small enterprises. IT company workers face 
various difficulties, including challenges with buildings and infrastructure, software, and HR issues. The IT 
service desk or Helpdesk, which is often accessible over the Internet, is used by employees of an organization 
to report an issue [3]. The problem tickets will be assigned to the relevant domain expert group or service desk 
representative based on the ticket category. 

Ticket categories, priority, and severity are just a few of the structured fields in web-based IT service 
desk solutions [4]. A free-form field called "ticket description" allows the user to submit a description of the 
ticket in their language. Employees manually select the problem's category, priority, and severity, as well as 
its description in standard English, while creating trouble tickets. Manual selection of the ticket category by 
the end user may lead to an incorrect ticket classification because it is based on the user's impression of the 
problem and if the user has registered the issue in the relevant category [5]. When tickets are incorrectly 
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categorized, they will be sent to the wrong resolution group, which will cause a delay in resolving the issue 
tickets. Conventional service desk systems work best with well-structured datasets [6]. We can use a variety 
of machine-learning approaches to build an automatic ticket classification system that addresses all of these 
issues. For example, to categorize a service desk ticket, an automated ticket classifier analyses the ticket's end- 
description user in natural language, which uses both supervised and unsupervised machine learning 
approaches to build ticket classifier models [7]. Furthermore, classifier models can be constructed using 
supervised machine learning techniques such as classification algorithms when the label or category of 
historical ticket data is known [8]. Therefore, this paper proposes a machine-learning-based classification of 
IT tickets. 


2. RELATED WORK 

The information will most likely be presented as free text if a ticket is generated automatically. 
Harun et al. mentioned that the strategy for constructing an automated service desk ticket classifier system was 
created by Paramesh et al. [9], who did their research by analyzing data from IT infrastructure helpdesks. 
Traditional supervised machine-learning methods are used to construct classification models [10]. A 
comprehensive investigation is carried out into the methods that can be used to deal with undesirable, 
imbalanced, or wrongly labeled data. The convolutional neural network performed significantly better than any 
other model examined in other classification models [11]. Machine learning and natural language processing 
techniques are utilized during a system's development [12]. Analyze the tickets by manually evaluating them 
using n-gram analysis and contextual mining [13]. The resulting model had an error rate of only 1.4% when 
classifying each ticket into the appropriate root cause category. Developed trouble miner to sort trouble tickets 
according to their underlying causes [14]. According to the results of the study, most tickets are caused by 
disruptions in the network cables and routers. Constructed regression and classification models to predict the 
resolution times [15]. It was decided to eliminate the fields that held text data because the text had to be entered 
by a human every time. There is also a text area included in this thesis; however, the text within it is not 
produced by a person but rather by a machine. Because of this, the text box would be mined for helpful 
information. Regarding classification, the resolution time was divided into three categories, and the resulting 
model had an accuracy of approximately 74.5%. On the other hand, when it came to regression, the artificial 
neural network had the lowest mean absolute error, which was 24.8 hours [16]. Sample et al. mention that 
according to Lofgren [17], employed data mining and machine learning methods to determine the underlying 
cause of network issues. This allowed him to provide engineers with actionable recommendations and, as a 
result, reduce the amount of time spent on the process of troubleshooting. The model had an accuracy of up to 
90 percent when predicting the root cause of the most prevalent root cause and only 70% when discriminating 
between up to 20 different root causes [18]. 


3. METHOD 

Applying the predictive models, feature ranking, and selection techniques to the dataset with the body 
attribute and also without the body attribute [19]. Preprocessing was carried out to convert the textual data in 
the body tag to numerical data. To perform this, we have to use the count vectorizer library [20]. After the 
conversion is done, normalize all the data into a range of 0 to 1. After this step, feature ranking is carried out 
to understand which features are of utmost importance, and the feature selection technique is used to improve 
the efficiency of the predictive models such as the support vector machine classifier (SVM/SVC), Gaussian 
Naive Bayesian, decision trees, logistical regression, and k-nearest neighbours (KNN). The settlement curves 
produced at SG1 and SG2 has been illustrated in Figure 1. 


3.1. Dataset 

This dataset was retrieved after performing clustering and labeling mechanisms obtained from our 
previous study. The best resultant algorithm of the prior survey, latent dirichlet allocation-based (LDA-based) 
topic prediction, which contains the 13 topics, was used as the target attribute for classification. The dataset 
includes a total of 47,664 incidents initially taken from the service now platform [21]. Figure 1 and Table 1 
show how the characteristics in the dataset were used to perform this study. Yes/No values in the usage column 
suggest the usage of a particular feature for this study. 


3.2. Environment 

To conduct this research, we have used the following experimental setup. We have used Python 
programming language and Jupiter notebook. Also, an Intel i5 processor with 32 GB RAM was used to conduct 
this study. 
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Figure |. The flowchart of the proposed approach 


4. METHODOLOGY 
4.1. Attributes in the data set 

The body attribute contains textual data, whereas all other fields contain numerical data. Hence, a 
separate analysis was conducted to identify the performance of classification algorithms using the body 
attribute [22]. The range for logistic regression is between 0 and 1, but the range for linear regression is 
unbounded. This is the primary distinction between the two types of regression [23]. In addition, in contrast to 
linear regression, logistic regression does not mandate the existence of a linear connection between the 
variables that serve as inputs and those that are analyzed as outputs. 


Table 1. Attributes in the data set 


Attributes Description Usage 
Topic prediction This contains the topic prediction values ranging from 1 to 13 Yes 
Body This field contains the agent entry of the ticket description 7 
Ticket type This field contains a Numerical Value of either 0 or 1, 0 refers to email, and | relates to phone Yes 
Category This field contains a numerical value ranging from 0 to 12 Yes 
Sub_category1 This field contains a numerical value ranging from 0 to 58 Yes 
Sub_category2 This field contains a numerical value ranging from 0 to 118 Yes 
Business service This field contains a numerical value ranging from 0 to 102 Yes 
Urgency This field contains a numerical value ranging from 0 to 3. 3 is the urgent ticket, 0 no urgency. Yes 
Impact This field contains a numerical value ranging from 0 to 4. 5 is the highest impact and 0 is the lowest. Yes 

4.2. CNN 


It is one of the deep learning algorithms that takes input data and assigns them tags and ids based on 
their weights or parameters. These tags and ids are used to differentiate the characteristics of the features 
extracted from the data. It also requires very less pre-processing of the data, as it classifies them by itself during 
the process and learns from them [24]. The functioning of the convolutional neural network (CNN) algorithm 
as shown in Figure 2 is similar to that of the human brain [25]. It consists of neurons that pass through several 
networks to modify the extracted data and finally learn the features [26]. It also takes advantage of the spatial 
and temporal features of the dataset and improves its functioning. Mostly, it is suitable for image datasets as it 
takes advantage of the pixel information in the images. The layers used in the CNN algorithm are discussed in 
detail in the following table. 


(____________J 


Convolution Pooling Fully Connected Layer Dropout 


Figure-2. The architecture of the CNN algorithm 
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Let CNN have in m blocks Let us, input x represented as the probability of getting connected to the 
ith block of the SAM [27]. A controller output helps in sampling the realization a which is a~p. The scheme a 
has the probability of p* = (p*1, p’2, ...., p’m_). The G(a) is denoted as input of the [th block and the mapping 
of the residual is fl(. ), then the xl+1 is the output of the [th block is derived, 


X14. =X, +f) (1) 


— FULL SELF-ATTENTION (FULL-SA) NETWORK 

In (2) shows that l¢’ block is denoted as M(. ; w]), and it is placed in w] parameters. Then the equation 
is developed M(fi(xl); W1) which includes the processing end of the extraction process [28]. Finally, the 
residual output is f] (x]). Where / = 1, ...., m and © is denoted the element-wise multiplication. In (2) defines 
the cost of the computation and the parameters are increased based on the number of blocks m. 


X41 =X +M(fx): WOR) (2) 


— CONNECTION SCHEME 
Assume that the CNN has m blocks. A sequence a= (Q], a2 ..., am) indicates a connection 
scheme, where aj = | and it” block is connected to a SAM. The scheme formulated in (3) is given. 


X41 = Xr + (Qi + MGM): Wi) + A — a ).1) Of) (3) 


Al the one vector is defined here as 1 and the | lies between the 1 and m. The all-one vector is 
represented as a full-SA network, and the CNN allows the neurons if a represents 0. The reward is given to a. 
The controller has the parameter set of 8 and the 77 is the policy gradient of the learning rate. 


Ro = G(a).:™, logpy',0 —0+7n.VRo (4) 


In this manner, the controller provides the probability for the reward G. Searching for a good structure 
of G can help in finding a better structure [28]. Through the connection ratio and accuracy, the better reward 
G can be found. The subnetwork (x V a) provides a validation accuracy of gval which is obtained by sampling 
the super net of the reward. 


I_ (CNN with SAMS)-I; (Original CNN) 0 
I,(Original CNN) x 100% (3) 

The network’s inference time /t (.). The batch size is defined to be between 50 to 1,000 times. The 
T(x) is defined as the Lipschitz continuous function, and the K is the d-dimensional compact set’s Lebesgue 
integrable function [29]. The overall subnetwork consists of the depth and width of the layer Rful(x, Ofull) 
which is smaller than the constant €0. 


Sc Rerun fun) — Rsuv(x)| dx < € (6) 
Spalf(x) — R@O| dx se (7) 
fealk@.0;) =Tldxse (8) 


The skip connections are seen from the formulas, (x, 00) = f(x, 80). 


Spa lf (x, 97) T| dx = Jnalg(x, 69) —T|dx<e (9) 
Wx(r) = max __If(x) —fO)| (10) 
xy € K,||x—-yl|<7r 


Let T(x) can be seen that Lipschitz continuous function algorithm, in which the following equation 
shows its functioning, 


IT(x) —TO)| S Lx — y| (11) 


then we have given that wk(r),= Lr = eVol (kK). 
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€ 


an Vol(K)L (12) 
Here, if the r= %epsilon/ (Vol(k).L) , then the € = (€0,1), 
o(1/r4) = 0((D*) < 0()*) = cS" (13) 


the constant C in the equation is multiplied by the Lemmavalue, which exists in the CNN Rshort(x) and its 
width lies within d and the C(eQ)¢ also lies within it. 


dep (Rivas) = dep (Rerun) (14) 


It can be seen from the equation that; the d value is greaterthat Rlgn. It can also be stated that 
xeK, and Rshort(x). 


Riong(X) = f, |G) — Riong (x) |dx = J, IT) — Renore (0) dx < €/2 (15) 
Then J. |Reuu(x, 2) — Riong(x)|d < J, 7G) — Riong @)|dx (16) 
+f, lRruu(®, Of) — TOD|ax (17) 
<€/2 +€)/2<e (18) 


The Rlong represents the CNN algorithm that has a width smaller than that of d, while the Rfull is 
greater than d. The inequality in the network is satisfied by the Rfull. 


mi = AVG(X!) = YH Do Xow a 


The features are processed by the sigmoid function si(z) = 1{1 + e-z). Here the reduction rate is 
defined as r and the division extracted as ‘//’. Thus, the size of the hidden layer is C//r. The information obtained 
from all the channels is fused using the rectified linear unit (ReLU) activation function [30]. The block wise 
information of the long short-term memory (LSTM) is integrated through the environmental impact assessment 
(EIA) module. The average pooling output is termed as ml that is passed to the hidden state h. 


[5,5 8c] = sig (FC (fmt; mb]; W,)) oF 


The block wise information of the LSTM is integrated through the EIA module. The average pooling 
output is termed as ml that is passed to the hidden state h. Zero vectors are termed as the hO and cO. 


(hi, c)) = LSTM([m{; +++; me], hi-1, C-1;W) (21) 


The G represents the number of groups, and the feature maps are represented as C/G [31]. The feature 
maps are grouped as (C//G) x H x W that represents the YI that lies within XI. 


1 
Ge = AVG(YE) = = Wiles Divar Yenw (22) 
The coefficient of importance for each value g in[g;...; gl] 1 C//G, 

Paw = 9-YL:, hw] (23) 


the value p_hw is normalized in the following steps. 


Brow = (24) 


ote 
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The u and o2 are the mean and variance of the tickets which can be calculated through the following 
equation, 


1 1 
u= ay unel Lwsi Pawo = ay unel Y= Paw 2°) (25) 


for the group YI, some additional set of parameters (y, 6) are added to rescale and normalize the features, and 
the attention received by the sun grid engine (SGE) modules [:, 4, w] are written as, 


sig VPnw + B) (26) 


the full CNN algorithm is accelerated through the g_spa. The g_spa encourages the network to achieve better 
results by generating fewer connection schemes that are found between the tickets. 


llallo 
Ispa = 1- a (27) 

More schemes can be explored in this process, as there is mitigation and convergence during the 
training iterations and the reassigned number database (RND) bonus provides better convergence of the 
iterations. The difference in the output is reduced by the RND. 


G(a) iz A1-9spa ae A2-Ivat oy A3-9rnd (28) 


The proximal policy optimization method is used for faster training and sampling of the connection 
schemes. It also provides better efficiency in the utilization of the data [27]. The tuple is kept in a buffer after 
updating the parameters %theta and %phi. Layer(type):conv2d_1, conv2d_2 max_pooling2d_1, dropout_1, 
dropout_3, dense_2 Output:(Conv2D),(Conv2D), MaxPooling2,(Dropout,(Dropout), (Dense) Shape:(None, 
75, 100, 32),(None, 75, 100, 32), (None, 37, 50, 32,) (None, 37, 50, 32), (None, 128), Parameters: 896, 9248, 
0, 0, 903. Total params: 3,752,999, Trainable params: 3,752,999, non-trainable params: 0 

The number of convolutional, pooling, and fully convoluted layers used in the algorithm [32]. It also 
is said that the proposed algorithm provides better ticketing of the IT and it needs to be confirmed whether the 
proposed algorithm is more efficient than other existing methods. Most of the existing methods considered 
classification algorithms, while this paper has considered CNN for the efficient processing of the data. 


4.3. Feature ranking and selection 

Feature selection as automatically contributes the most prediction variable or output in which you are 
interested which is made by feature extraction [33]. The following are some of the advantages that come from 
completing feature selection before modeling your data: To avoid overfitting, collect fewer duplicate data. This 
will offer the model a performance boost and result in fewer opportunities to make judgments based on noise. 
In addition, it reduces the amount of time needed for training. Since there is fewer data, the algorithms train 
more quickly [34]. 


4.4. Chi-square 

In statistics, the chi-square test is used to determine whether or not two occurrences may be considered 
independent. Chi-square, we utilize it in the feature selection process to determine whether or not the incidence 
of a specific word and the occurrence of a particular class are independent of one another [35]. Oi = Actual 
Observation Ei = Expectation. If the matching chi-square score for each feature is high, this suggests that the 
null hypothesis HO of independence should be rejected and that the occurrence of the feature and class depend 
on one another [36]. 


x2 = ¥ (Oi - Ei)2/Ei (29) 


4.5. Recursive feature elimination (RFE) 

Recursive feature elimination for selecting features that best fit a model and eliminating the part (or 
features) that are the weakest until the necessary number of features has been attained. The features are 
prioritized according to the model's coefficient or the feature priority characteristics. RFE makes an effort to 
remove any dependencies and collinearity present in the model by iteratively deleting a small number of 
features at each iteration of the loop [37]. RFE necessitates retaining a certain number of features; however, it 
is not always possible to predict how many elements will be considered legitimate. Therefore, cross-validation 
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is used with RFE to score several feature subsets and choose the collection of features with the highest score. 
This allows for the optimum features to be determined. 


5. RESULT 

The category attribute in the dataset consists of 13 categories. All the tickets in the dataset are labeled 
with topic prediction results from our earlier research work. Tables 2 and 3 show the results of the performance 
analysis. Table 2 shows the result of recursive feature elimination on the obtained dataset [38]. Logistic 
regression ranked | for urgency, impact, and ticket type attributes [39]. On the other hand, predictive algorithms 
such as SVC, Gaussian Naive Bayesian, and KNN algorithms were not applicable with RFE and hence denoted 
as NA. 


Table 2. Features rankings with and without body attribute using RFE 


FEATURE RANKING USING R.F.E. WITH BODY FEATURE RANKING USING R.F.E WITH BODY 
Logistic Random Decision CNN Logistic Regression Random Forest Decision Trees CNN 
Regression Forest Trees 
Ticket Type 1 5 5 6 2 8 8 2p 
Category 2 3 3 5 5 1 1 1 
Subcategory 1 3 1 1 1 6 1 1 1 
Subcategory 2 5 1 1 1 8 1 1 1 
Business 4 1 1 1 7 1 1 1 
Urgency 1 2 2 3 4 1 1 1 
Impact 1 4 4 4 3 7 7 6 
Body 1 2 2 3 1 (2-6) (2-6) (1-5) 


The results of using RFE with the body attribute are presented in Table 2. When the boy attribute is 
used, we can observe a change in the ranks of features in the dataset. Body attribute has 10 details when they 
are converted from textual to numerical value [40]—[42]. The random forest and decision trees have been 
awarded a ranking of 2 to 6 for all 10 body attributes. Random forests and decision trees had a similar order 
for the other features. Logistic regression has awarded a rank] to the body attribute. 

Decision trees had a higher accuracy and better specificity and sensitivity when compared with the 
logistic regression and random forest while using the body attribute without the body attribute [43]-[45]. We 
have carried out the chi-square feature selection technique on our obtained dataset. The number of features 
value is set to 3. Feature selection has produced a similar result as the Feature ranking, where features sub 
category 1, subcategory 2, and business had better results when compared with any other combination of 
features [46]—[48]. Table 3 and Figure 2 show the performance analysis results of employing the chi-square 
feature selection technique on the predictive models. KNN has a better accuracy of 89.22% over the other 
models when body attribute is not used, and SVM/SVC had a better accuracy of 86.86% compared with the 
others. Table 3 shows the mean performance of these models. To evaluate the overall better predictive model, 
we have conducted a mean performance analysis where the average is calculated considering accuracy, 
specificity, and sensitivity for both with and without body attributes [49], [50]. 


Table 3. Performance of chi-square on with and without body attribute 
Without Body Without Body 
Methods AccuracySpecificitySensitivity Methods Accuracy SpecificitySensitivity 
Logistic Regression 87.45 95.65 98.51 Logistic Regression 82.52 92.74 96.31 
SVC 87.36 95.44 98.50 SVC 86.86 94.91 97.1 
Random Forest 85.26 94.43 97.21 Random Forest 80.81 90.79 94.65 
Decision Trees 85.41 94.88 97.56 Decision Trees 81.61 91.56 95.88 


Gaussian 87.03 95.03 98.04 Gaussian 82.03 92.15 96.23 
KNN 89.22 96.55 98.91 KNN 83.55 93.27 96.59 
CNN 98.43 97.56 98.56 CNN 98.32 98.56 98.45 


CNN algorithm outperforms all the algorithms computing faster with better accuracy and FI score. 
All the algorithms are trained with the training dataset to improve the performance of the algorithms. However, 
the clustering algorithms provide very less monthly recurring revenue (MRR) compared to the CNN algorithm. 
It also is termed as the inefficiency of the algorithm to learn the features and it takes time and quality data to 
improve the accuracy of the algorithms further. But the CNN algorithm with the minimum number of datasets 
and features provides better classification and MRR. The sample rate tuning of the algorithms is considered in 
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this work. It can be seen Figure 3 that the CNN algorithm has the minimum sample rate tuning compared to 
other clustering and regressive algorithms. Through this, it can be concluded that the CNN algorithm provides 
efficient and accurate results compared to other algorithms and it also outperforms other algorithms in terms 
of cost, resources, and other metrics. 


Mean Performance with Chi-Square 
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Figure 3. Mean performance of predictive models with chi-square 


6. CONCLUSION 

As a result of our past work, the unsupervised ticket dataset has been classified and labeled, 
transforming it into a supervised dataset. In the retrieved dataset, only the body attribute is textual. Through 
this research, we have conducted performance analysis of several feature ranking and feature selection 
techniques (RFE and chi-square) when combined with predictive models such as SVM/SVC, Gaussian Naive 
Bayesian, decision trees, logistical regression, and KNN For Feature ranking, the Decision tree algorithm 
performed better when compared with the Random Forest or Logistic Regression algorithms. KNN algorithm 
performed well without using textual data when combined with chi-square. While analyzing the overall 
performance of predictive models (with and without body attributes), when paired with the chi-square feature 
selection technique, the CNN algorithm outperformed all other methods with a mean accuracy of 98.32%. 
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